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Abstract 

The WWW Entrez server is a WWW interface based upon the National Center for Biotechnology Information 's Entrez retrieval database and 
software. Entrez is a molecular sequence retrieval system, which contains an integrated view of portions of MEDLINE, and all publically available 
nucleotide and protein databases, including GenBank . 

While Entrez was already available as both a CD-ROM and a true standalone client/server application (" Network Entrez '*). a WWW server was 
developed to obtain the advantages of the World Wide Web. These advantages include the wide availability of WWW clients, support for "dumb" 
terminals which Entrez does not support, and the benefits of client software which is supported by others (e.g., NCSA Mosaick 

The availability of alternate implementations of Entrez provides us with a rare opportunity to investigate the strengths and weaknesses of the WWW. 

In particular, we can compare the ergonomics, performance and bandwidth requirements of WWW Entrez to those of the true client/server 
implementation. In addition, the relative ease of integrating the two network services into other molecular biology software is discussed. Finally, we 
present usage statistics for WWW Entrez, the Entrez CD-ROM, and Network Entrez; the degree to which these services are actually used may provide 
the ultimate measure of effectiveness. 
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History 

Like many WWW servers, the WWW Entrez server had an earlier incarnation as a retrieval system. Unlike most such systems, however, Entrez also 
existed as a true client/server Internet application. 

Entrez CD-ROM 

In 1992, the National Center for Biotechnology Information (NCBI) began distributing the Entrez CD-ROM subscription, issued six times annually. 

The CD-ROM subscription, containing both data and software, provides an integrated view of the public DNA and protein databases, as well as a 
related subset of the MEDLINE medical literature database. 

The growth of information which molecular biologists need to access to support their experimental work is growing at an explosive rate. The scientific 
literature adds some 6,000 peer-reviewed articles per month. Databases which contain the codes for gene sequences double in size every 20 months. 
Biologists need a retrieval tool which offers easy, accurate and complete access to this type of information and this was the motivation for developing 
Entrez. Entrez is a search tool for integrated access to the biological literature and sequence data £3]. 

In addition to conventional indexed lookup techniques and the ability to link among the three databases, Entrez also provides the ability to link within 
a database by finding "related" sequences or literature abstracts. These neighboring relationships within the nucleotide database are pre-computed by 
comparing the entire nucleotide database against itself using the BLAST! 11 sequence similarity algorithm. Proteins are also neighbored using BLAST, 
and the literature abstract database is neighbored against itself using a statistical text retrieval algorithm f^. 







V. 



http://archive.ncsa.uiuc.edu/SDG/IT94/Proceedings/BioChem/epstein/WWW_entrez.html 



T 



3 



Wednesday, November 28, 2001 



WWW Entrez; A Hypertext Retrieval Tool for Molecular Biology 



Page 




Network Entrez 

A network-based retrieval mechanism for the Entrez CD-ROM data was needed since CD-ROMs faced the following limitations: 

• Slow access speed 

• data currency 

• capacity is limited to 660 Mbytes per CD-ROM. 

Because of these advantages, Network Entrez was written, and released in mid-1993. Network Entrez is a true client/server software package where 
the client and server transmit and receive data according to a well-defined Abstract Syntax Notation 1 (ASN.1[5]) protocol, and only small amounts of 
data are sent across the network to facilitate smooth software operation. The Network Entrez client obtains a connection with a suitable Network Entrez 
server by first connecting to a central Dispatcher at a well-known address. The use of the client< - > Dispatcher protocol, also an ASN.l protocol, 
results in the Dispatcher instructing a suitable server to connect to the client. The desired client< - > server connection is thereby established. 

The communications protocol provides a desirable layer of abstraction between the client and server. In practice, there are two different Network Entrez 
servers which use different types of underlying databases, but understand the same ASN.l protocol. 

To reduce NCBI's software support requirements and to provide local support to users, a Network Entrez administrator at each site takes responsibility 
for local users. As of this writing (September 1994), there are over 800 registered Network Entrez sites in 32 countries. 

WWW Entrez 

While Network Entrez opened access to Entrez for a large number of Internet users, there were several motivations for building a WWW version of 
Entrez: 

• Serve vt-100 class users who previously did not have access to Entrez, since Entrez and Network-Entrez are window-based applications. Note 
that lynx , for example, is a WWW browser which does not require a windowed environment. [Note: a vt-100 Entrez navigator, CLEVER , was 
concurrently developed independently of WWW Entrez] 

• The linking and neighboring information available in Entrez can be naturally expressed as hypertext 

• The universality of the World Wide Web 

• WWW browsers are supported by third-part developers 

• The ability to link to external data sources on the Web. 

The WWW Entrez server was constructed using a combination of Bourne shell scripts and a single C program. This C program, entrcmd, is a 
search engine which can perform powerful retrievals using a simple UNIX-command-line query language. For example, entrcmd can look up the 
entries associated with a Boolean expression, and then perform an arbitrary number of rounds of inter-database linking and intra-database neighboring 
using the MEDLINE, protein, and nucleotide databases. Also, given a starting term (e.g., "Jones JD") it can fetch a number of consecutive terms from 
a given field (e.g., 50 author names). In its role as the WWW server search engine, entrcmd only runs on a UNIX host; however, it is layered on top 
of the NCBl toolbox , which is portable across a wide variety of platforms. The source code for entrcmd appears within the NCBI toolbox, and can 
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run on all the NCBI-supported platforms. 

Ergonomics 

Ergonomics of Entrez 

The ergonomics of the WWW Entrez server were intended to be similar to the existing Entrez programs, wherever possible. In Entrez (both CD-ROM 
and Network versions), a user can perform an initial Boolean query by selecting: 

• a database (MEDLINE, protein, or nucleotide), 

• a field (e.g., "Author Name" or "Journal Title"), and 

• a mode which allows query terms to be viewed in different ways (e.g., "selection mode", which allows the user to scroll through all the 
available terms, alphabetically). 

Having selected one or more terms in this manner, the user may perform simple Boolean operations by clicking and using drag-and-drop. More 
complex Boolean queries can be performed using a combination of point-and-click and typing. 

In Entrez, each toggle of a popup (e.g., database or field) and each entry of a datum results in an action by the Entrez application program. 

Furthermore, continuous scrolling of huge alphabetical lists is possible, since moderate scrolling can result in the fetching of only a small number of 
terms. Experience shows that this scrolling ("selection mode") is a valuable feature of Entrez, since users don't always know the exact query term or 
author name. 

Once a Boolean query has been composed by the user, the user can fetch and view the resulting document summaries. Again, these document 
summaries appear as a scrolled list, and only a few entries need to be retrieved at a time. The user may then view an entire entry by double-clicking on 
a document summary of interest. 

Links to other databases or neighboring within the current database may be performed by selecting the target database and marking one or more 
articles to be used as the basis for linking/neighboring. For example, by marking several MEDLINE articles a user can retrieve the related sequence 
data from the DNA database. 

Ergonomics of WWW Entrez 

In WWW Entrez, an attempt has been made to preserve as much of the familiar look of Entrez as possible, without wasting unreasonable amounts of 
network bandwidth. WWW Entrez is inherently slower than Network Entrez. This is because: 

• The portion of the WWW Entrez server which uses the entrcmd engine is written using Bourne shell scripts. If it were written, for example, 
using PERL, simple queries would run several seconds more quickly. 

• The statelessness of WWW Entrez requires that the Entrez database be re-initialized each time a URL is fetched. This adds approximately 
one second to each query. Furthermore, when several rounds of neighboring have been performed, recomputation is necessary. This 
computation is specific to WWW Entrez and is not performed by Entrez/Network Entrez. Again, this is due to the statelessness of the WWW 
browser/server interaction. 

• More bandwidth is required to perform analogous operations on WWW Entrez, since Network Entrez is optimized to transfer only the 
unformatted data which is required (formatting is performed by the client), while WWW Entrez must transfer formatted data. 

• When retrieving a large amount of text. Network Entrez only retrieves the data surrounding the records that are currently visible, while WWW 
Entrez must retrieve all of the formatted data. 

WWW Entrez discussion 

Some of the limitations of WWW and HTML include: 

• a "click" can only result in an action when the user clicks on a piece of hypertext or presses a form submission button. Adjusting values on a 
FORM does not result in any action until the form submission button is pressed. 

• such a click results in a new page being fetched, rather than modifications to the current display 

• display of a large document in an on-demand fashion is impossible; the document must be fetched in its entirety. 

In addition to these constraints, clients with FORMS capability were rare in the fall of 1993, when the WWW Entrez server was written. Even at the 
time of this writing (September 1994), reliable FORMS-capable WWW browsers are not ubiquitous. Therefore, it is desirable to have servers which 
can take advantage of FORMS, but can still be useful to users who lack FORMS-capable browsers. WWW Entrez includes both FORMS-based and 
non-FORMS-based interfaces. 

Given the constraints of non-FORMS HTML, the simplest approach was to decompose the Entrez query interface into many screens, each 
corresponding to a traditional hierarchical menu system. Thus, for example, to lookup a MEDLINE article by author name, one first selects 
"MEDLINE" from a top-level menu, then selects "Author Name" from a menu, and finally makes a query using the browser's search window. 

Given the power of FORMS, it is possible to express a complex Boolean query on a single form. The sample MEDLINE form which appears below 
has considerable expressive power, allowing the Boolean composition (union, intersection, and set subtraction) of several terms, where each term may 
use a separate indexed field. 

Because of this unified approach, some users have stated that they prefer this query interface to that provided by Entrez/Network Entrez. 
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As an option on the single FORM, a user may choose to "browse" from 50 of the available terms rather than typing an exact term. For example, if a 
user doesn't remember how to spell "adenoviridae", typing "adeno" and using the "browse" option is helpful. ITiis is also especially useful when 
searching by author name. Note that this is a partial emulation of Entrez's "selection mode", described earlier. 

Hypertext is also a natural way to represent the hierarchical organism taxonomy which was added to the Entrez CD-ROM in May 1994, and has been 
found to be particularly useful to molecular phylogenists and other systematists. [41 This feature was subsequently added to WWW Entrez. 

The visual clarity of hypertext permits the coherent presentation of rich text, compensating for other disadvantages of WWW and HTML. For 
example, the number of neighboring MEDLINE, protein, and nucleotide articles is presented to the user in WWW Entrez, smoothing his/her access to 
the available data. If these counts were computed and displayed in Entrez/Network Entrez, the screen would be cluttered, and, because of the extra time 
required to perform the computations, Entrez's otherwise quick performance would be compromised. 

The power of the entremd engine allows each Entrez query to be stateless, even though a user may perform several rounds of neighboring and linking 
between and within the MEDLINE, protein, and nucleotide databases. This statelessness requires the recomputation of some results which could be 
stored, but the simplicity and power which statelessness provides is considerable. Each WWW Entrez URL is an encoding describing an originating 
Boolean expression or set of unique identifiers, along with the "rounds" of neighboring and linking which have been performed. 

Comparison of three versions of Entrez 
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Usage 

The usage of WWW Entrez and Network Entrez has grown dramatically during 1994, while the growth of the Entrez CD-ROM subscriptions has 
reached a plateau. 

Comparing usage of the three types of Entrez is complex because there is no way to know how many users are actually using the Entrez CD-ROM 
and its derivative hard-disk copies. Furthermore, there is no exact analogy between Network Entrez sessions and WWW Entrez URLs. However, a 
more detailed study suggests that an average Network Entrez session corresponds to roughly eight WWW Entrez URLs. 



2500 

2000 

1500 

1000 

500 

0 





http://archive.ncsa.uiuc.edu/SDG/IT94/Proceedlngs/BioChem/epsteln/WWW_entrez.html 

7 




WWW Entrez vork-dag usage Netvork Entrez vork-dag usage 



j 



Wednesday, November 28, 2001 WWW Entrez: A Hypertext Retrieval Tool for Molecular Biology Page: 6 

1400 
1200 
1000 
800 
600 
400 
200 
0 




El hosts /day 
□ users/day 
I sessio ns/da y 


















OS 


OS 


os 


os 


os 


tJS 


tJS 


a. 


> 


c. 


L_ 




1 




a> 


o 


CO 


CD 


CD 






CO 


Z 


“0 


z 


z 


“D 


CO 




http://archive.ncsa.uiuc.edu/SDG/IT94/Proceedings/BioChem/epstein/WWW_entrez.html 

8 



Wednesday, November 28, 2001 



WWW Entrez: A Hypertext Retrieval Tool for Molecular Biology 



Page: 7 



V 



3-way Entrez comparison 



700 y 
600 -- 
500 -- 
400 -- 



300 -- 
200 -- 
100 -- I 

0 -I — 

CM 

ON 

I 

Cl 

CO 



Conclusions and future directions 

The rapid growth in use of all three types of Entrez shows that it is a powerful tool which meets the needs of the molecular biologist. 

In recent months, the plateau in Entrez CD-ROM subscriptions has been triggered by the wide availability of the two Internet-based versions. In some 
ways, this is not a bad thing, especially because the Entrez CD-ROM subscription expanded to two CDs in October 1993, and must expand to three 
CDs in October 1994, due to the dramatic growth in molecular sequence data. 

WWW Entrez and Network Entrez complement, rather than compete with one another. WWW Entrez is useful for those who prefer a single software 
tool and who can accept the slower performance. Network Entrez is critical for high performance, smoother ergonomics, and custom- written 
applications which need pre-parsed data. Finally, for some users either Network Entrez or WWW Entrez fails to function for some technical reason. 
In these cases, the NCBI staff has usually been able to refer the user to the alternate Internet-based solution. 

The WWW Entrez server is able to profit from the interconnection capabilities of the World Wide Web. First of all, Entrez's own data is highly 
interconnected. Secondly, for some protein data where external information is available on the Web, WWW Entrez points to the external Expasy . 
“Molecules R Us" servers. Finally, several servers point to the WWW Entrez server to obtain data, notably Expasv and the Baylor College of 
Medicine's sequence annotation server . The latter is especially interesting because it provides a way to annotate sequence database entries without 
modifying the original entry, and while pointing to the canonical Entrez sequence entry. 

In the future, the Entrez WWW server will include an interface to a powerful structural molecule visualization tool (RasMoH . and will contain 
daily-updated molecular sequence data databases. 
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